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Abstract — In this paper, we consider delay-optimal power 
and subcarrier allocation design for OFDMA systems with Nf 
subcarriers, K mobiles and one base station. There are K queues 
at the base station for the downlink traffic to the K mobiles 
with heterogeneous packet arrivals and delay requirements. We 
shall model the problem as a A' -dimensional infinite horizon 
average reward Markov Decision Problem (MDP) where the 
control actions are assumed to be a function of the instantaneous 
Channel State Information (CSI) as well as the joint Queue 
State Information (QSI). This problem is challenging because it 
corresponds to a stochastic Network Utility Maximization (NUM) 
problem where general solution is still unknown. We propose 
an online stochastic value iteration solution using stochastic 
approximation. The proposed power control algorithm, which 
is a function of both the CSI and the QSI, takes the form of 
multi-level water-filling. We prove that under two mild conditions 
in Theorem [l] (One is the stepsize condition. The other is the 
condition on accessibility of the Markov Chain, which can be 
easily satisfied in most of the cases we are interested.), the 
proposed solution converges to the optimal solution almost surely 
(with probability 1) and the proposed framework offers a possible 
solution to the general stochastic NUM problem. By exploiting 
the birth-death structure of the queue dynamics, we obtain a 
reduced complexity decomposed solution with linear 0{KNf) 
complexity and 0{K) memory requirement. 



I. Introduction 

There are plenty of literature on cross-layer optimization 
of power and subband allocation in OFDMA systems [1], 
[2]. Yet, all these works focused on optimizing the physical 
layer performance and the power/subband allocation solutions 
derived are all functions of the channel state information 
(CSI) only. On the other hand, real life applications are delay- 
sensitive and it is critical to consider the delay performance 
in addition to the conventional physical layer performance in 
OFDMA cross-layer design. A combined framework taking 
into account of both queueing delay and physical layer perfor- 
mance is not trivial as it involves both the queueing theory (to 
model the queue dynamics) and information theory (to model 
the physical layer dynamics). The first approach converts the 
delay constraint into average rate constraint using tail probabil- 
ity at large delay regime and solve the optimization problem 
using information theoretical formulation based on the rate 
constraint [3]-[5]. While this approach allows potentially sim- 
ple solution, the derived control policy will be a function of the 
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channel state information (CSI) only. In general, delay-optimal 
control actions should be a function of both the CSI and 
queue state information (QSI). In [6], [7], the authors showed 
that LQHPR policy is delay optimal for multiaccess fading 
channels. However, the solution utilizes stochastic majoriza- 
tion theory which requires symmetry among the users and is 
difficult to extend to other situations. In [8]-[10], the authors 
studied the queue stability region of various wireless systems 
using Lyapunov drift. Under the assumption that all queues 
are large enough, GPD (Greedy Primal-Dual) algorithm [11] 
and RT-SPD (Real Time Stochastic Primal-Dual) algorithm 
[12] are proposed to solve utility -based optimization problem 
under queueing network stability constraint and average delay 
constraint separately. 

While all the above works addressed different aspects of 
the delay sensitive resource allocation problem, there are still 
a number of first order issues to be addressed. In this paper, 
we consider an OFDMA wireless system with Np subcarriers, 
K mobiles and a base station. There are K queues for the 
mobiles at the base station with heterogeneous arrivals and 
departures. The delay-optimal power and subcarrier allocation 
actions, which minimize the average delay of the K MSs under 
the average total power constraint and subcarrier allocation 
constraint, are functions of both the CSI and the joint QSI. 
We shall elaborate the major challenges behind this problem 
below. 

• The Curse of Dimensionality A more general approach 
is to model the problem as a Markov Decision Problem 
(MDP) [13], [14]. However, a primary difficulty in deter- 
mining the optimal policy using the MDP approach is the 
huge state space involved. For instance, the state space 
is exponentially largeQ in the number of users. Hence, 
brute force solution by value iteration or policy iteration 
is not applicable due to huge complexity and memory 
requirement involved. 

. Issues of Convergence in Stochastic NUM Problem 
In conventional iterative solutions for deterministic NUM 
problems, the updates in the iterative algorithms (such as 
subgradient search) are performed within the coherence 
time of the CSI (the CSI remains quasi-static during the 
iteration updatesj^ [15]. In stochastic NUM, however, 

' For a system with 4 users, 6 subcarriers, a buffer size of 50 per user and 
4 channel states, the system state space contains 4*'^® X (50 + 1)'' states, 
which is already unmanageable. 

^This poses a serious limitation on the practicahty of the distributive 
iterative solutions because the convergence and the optimality of the iterative 
solutions are not guaranteed if the CSI changes significantly during the update. 
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the updates are performed over the ergodic realizations 
of the system states, and hence, the convergence proof 
is challenging (such as GPD in [11] and RT-SPD in 
[12]). When we consider the delay-optimal problem, 
the problem is stochastic and the control actions are 
defined over the ergodic realizations of the system states 
(CSI,QSI). Therefore, the convergence proof is also quite 
challenging. 

• Heterogeneous Users There are some works that ob- 
tained a simple delay-optimal policy for multiple access 
channels by using majorization theory and exploiting 
symmetry between users [6], [7]. However, such simpli- 
fications cannot be extended easily to our case in which 
users have heterogeneous arrivals and delay requirements. 
In this paper, we shall address the above issues by proposing 
a low-complexity solution to the delay-optimal OFDMA sys- 
tem. To address the open issue concerning the huge complexity 
involved in solving the A'-dimensional MDP, we utilize the 
stochastic approximation (SA) techniques [16], [17] to derive 
a low complexity online stochastic vahie iteration algorithm. 
We shall show that under some mild conditions, the proposed 
online stochastic value iteration algorithm converges to the 
optimal solution almost surely (with probability 1). By ex- 
ploiting the birth-death structure of the queue dynamics, we 
obtain a reduced complexity decomposed solution with linear 
0{KNp) complexity and 0{K) memory requirement. 

II. System Models 

In this section, we shall elaborate the system model, the 
OFDMA physical layer model as well as the underlying queue- 
ing model. Fig. [T] illustrates the top level system model. There 
are K user queues at the base station which buffer packets 
for the K mobile users in the OFDMA system. These K 
application streams may have different source arrival rates and 
delay requirements and this corresponds to a heterogeneous 
user situation. The base station has a cross-layer scheduler 
which takes the CSI and joint QSI as the inputs and produces 
a power allocation and subcarrier allocation action as outputs. 




■I I I I I I — ^ O -M.<Q> 
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Fig. I. OFDMA physical layer and queueing model. 



A. OFDMA Physical Layer Model 

We consider an OFDMA system with Np subcarriers over 
a frequency selective fading channel with L independent 



multipaths. As a result, the received signal at the fc-th mobile 
and n-th subcarrier is given by Yjl^,^ ~ Hk^n^l „ + Z^. n, 
where Xf. ^ is the transmitted symbol and Hk.n,Zk.n ~ 
CA/'(0, 1) are fading coefficient (CSI) and noise respectively. 
Let Sfe „ S {0,1} denote the subcarrier allocation index for 
the fc-th user at the n-th subcarrier For simplicity, we assume 
powerful channel coding (such as LDPC) at the transmitter 
Furthermore, the mobile receiver has perfect knowledge of 
CSI. Hence, the maximum achievable data rate of the fc-th user 
is given by the mutual information between the channel inputs 
{Xk,n ■ Sk.n = 1} and the channel outputs {Yfc^„ : Sk^n = 1}, 
which is given by: 

Rk = Sk^nl {Xk,n',yk,n\Hk,n) 
n=l 
Nf 

Wk,-.?) (1) 

n=l 

B. Queue Model, System Dynamics and Control Policy 

In this paper, the time dimension is partitioned into schedul- 
ing slots indexed by t. Let r (second/slot) be the slot duration. 

Assumption 1: The joint CSI H(t) = {iJ/j.„(t)Vn, fc} re- 
mains quasi-static within a scheduling slot and i.i.d. between 
scheduling slot^. 

There are K queues at the base station transmitter for the 
K mobiles respectively. Data arrives in packets according to 
K random arrival processes and each packet is stored in one 
of the K queues according to its destination. Let A{t) = 
(Aiit),--- ,AK{t)) andN(<) = (A^i (<),■•■ ,NKit)) be the 
random new packet arrivals at the end of the t-th scheduling 
slot and the packet sizes of the packet in the front of the queues 
for the K users at the beginning of the t-th scheduling slot, 
respectively. 

Assumption 2: The arrival process Ak{t) and random 
packet size Nk{t) are i.i.d. over scheduling slots. 
Let Q(i) = (Qi(t), • • • , Qif (t)) be the joint QSI of the K- 
user OFDMA system, where Qk(t) denotes the unfinished 
work (number of packets) in the fc-th queue at the beginning of 
the t-th slot. Let R(t) = (i?i(t), • ■ • , (t)) be the scheduled 
data rates (bit/second) of the K users where Rk{t) is given 
by ([T]). At the beginning of the t-th scheduling slot, the cross- 
layer scheduler observes the CSI H(t) and the joint QSI 
Q(t) and calculate the scheduled data rate R(t). We assume 
the scheduler at the transmitter is causal in the sense that 
new packet arrivals A(t) at the t-th slot appears after the 
scheduler's action. Hence, the queue dynamics is given by 
the following equation. 

Qk{t+l) = mm{{Qk{t) - Rk{t)T/Nkit))++Akit),NQ} 

(2) 

^The quasi-static assumption is realistic for pedestrian mobility users where 
the channel coherence time is around 50 ms but typical frame duration is less 
than 5ms in next generation wireless systems such as WiMAX. On the other 
hand, we assume the CSI is i.i.d. between slots (as in many other literature) 
in order to capture first order insights. Similar solution framework can be 
applied to deal with FSMC fading. 
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where = max{a;, 0}, Nq denotes the maximum buffer size 
(number of packets). Thus, the cardinaUty of the joint QSI is 
/ — {Nq + 1)^ which grows exponentially with K. 

For notation convenience, we denote x{t) = (H(t),Q(t)) 
to be the global system state at the t-th slot. Given an 
observed system state realization x, the transmitter may adjust 
the transmit power and subcarrier allocation according to 
a stationary power control and subcarrier allocation policy 
defined below. 

Definition 1: (Stationary Power Control and Subcarrier Al- 
location Policy) A stationary transmit power and subcarrier 
allocation policy il = {Hp, 57^) is a mapping from the system 
state X to the power and subcarrier allocation actions. A policy 
n is called feasible if the associated actions satisfy the average 
total transmit power constraint and the subcarrier assignment 
constraint. Specifically, ftp{x) = P and fls{x) = s, where 
P = {Pk.n} are the power allocation actions satisfying 



^^E[pfc,„] <Po, Pk,n>0 



(3) 



fc=l 71=1 



and s = {sk,n G {0, 1}} are the subcarrier allocation actions 
satisfying 



K 

E 

k=l 



Sk.n = 1 Vn e {1,Nf} 



(4) 



Note that the K queues are coupled together via the control 
policy ri and the constraints in ^ and (|4|i. The goal of the 
scheduler at the transmitter is to choose an optimal stationary 
feasible policy fl* that minimizes the average delays of the 
K users. Assume that the arrival rate vector falls inside 
the stability region of the system [10]. Given a feasible 
unichain policy ft, the induced Markov chain {x{t) } is ergodic 
and there exists a unique steady state distribution tt^ where 
T^xiXo) = limt_.oo Pr[x(t) = Xq]- Hence, by Little's Law 
[18], the average delay of the fc-th user under a policy f2 is 
given by: 



Tk{n) = lim - V 
t=i 



E[Qk{t)] _ E,^ [Qk] 



Xk 



Xk 



yk e {l,K} 



(5) 

where the denotes expectation w.r.t. the underlying mea- 
sure IT. Similarly, the average transmit power constraint in (O 
is given by 

t—1 k.,n k,n 

(6) 

The average delay is related to the control actions (power and 
subcarrier allocation) via the packet service rates R(i) in ([TJ 
and the queue dynamics in Q. The delay-optimal scheduler 
design can be formulated as the following optimization prob- 
lenfl 

^We can apply exactly the same solution framework to the optimization 
problem which maximizes the physical layer throughput under the average 
delay constraint and the average power constraint because the Lagrangian 
function of such problem has the same form as our delay-optimal problem in 
(7) (both belongs to infinite horizon MDP). 



Problem 1 (Delay-Optimal Policy): For some constants 
f3 = {Pi, ■ ■ ■ , (3k) (Pk > Vfc), we seek to find a stationary 
policy n that minimizes 



K 



Jp =Y.(3kTk{n) + -fPt.{n) 



k=l 



1 

lim - J2^9ix{t),n{x{t))) 



where g{x, {p, s}) is the per-stage reward given by: 



?(x,{p,s}) = ^/5,^ 



A. 



k.n 



(7) 



(8) 



The positive weighting factors f3 in ^ indicate the relative 
importance of buffer delay among the K data streams and for 
each given /3, the solution to Q corresponds to a point on the 
Pareto optimal delay tradeoff boundary of a multi-objective 
optimization problem. The constant 7 > is the Lagrange 
multiplier for the average transmit power constraint in 

III. Markov Decision Problem Formulation 

In this section, we shall formulate the delay minimization 
problem in ^ as an infinite horizon Markov Decision Problem 
(MDP) and discuss the optimality condition. 

A. MDP Formulation 

A stationary control policy induces a joint distribution for 
the random process {x{t)}- Since the system queue level Q{t) 
evolves according to the system dynamics described in (|2]i and 
the arrival, departure and the CSI processes are memoryless, 
the random process is a Markov chaiijfjand hence, the 
optimization problem in (|7]i can be modeled as a MDP with 
transition probability given 

Pr[x{t + i)\x{t),n{x{m 

= Pr[H(< + l)\x{t), n{x{m Pr[Q(t + l)\x{t) , n{x{m 
= Pr[H(t + 1)] Pr[Q{t + l)\x{t) , n{x{t))] (9) 

As a result, the optimizing policy for the MDP in (|7]i can 
be obtained by solving the Bellman equation [13] (Page 308) 
recursively w.r.t. {S,{V{x)}) as below: 



: mm 



v{x) 

9{x,u{x)) 



Vx (10) 



where u{x) = ^{x) {P>s} are the power control and 
subcarrier allocation actions taken in state x and g{Xiu{x)) 
given by (O is the per-stage reward when the current state 
is X and action u{x) is taken. If there is a {d,{V{x)}) 
satisfying ( fTOl i. then ^ = infn is the optimal 
average reward per stage and the optimizing policy is given 

^Given cun'eiit queue state Q{t), the departure R(t) (which is determined 
by the control action f2(x(t))) and arrival A(t), the future queue level Q(i + 
1) is independent of the previous system states. 

^Although the QSI Q(t + 1) and CSI H(t) are coirelated via the control 
action f2(x(t)), due to the causaHty of the control action, H(t -|- 1) is 
independent of x(*) Q(* + !)■ 
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by ^*{x) ~ where u*{x) are the optimizing actions of 

JTOl l at State x- Furthermore, since the induced Markov chain 
is irreducible for any stationary poHcy considered, the 
solution to ( [Tol l is unique up to one additive constant. 

B. Reduced State Bellman Equation 

Instead of working on the global state space x = (H, Q), 
we shall derive a reduced-state Bellman equation from ( fTOl i 
using partitioning of the control policy fi, which is based 
on partial system state Q only. Specifically, we partition a 
unichain policy il into a collection of actions based on the 
QSI as follows: 

Definition 2 (Conditional Actions): Given a control policy 
a we define f](Q) = {(p,s) = n{x) : X = (Q,H)VH} as 
the collection of actions under a given QSI Q for all possible 
CSI H. The policy O is therefore equal to the union of all the 
conditional actions, i.e. 51 ~ [Jq 17(Q). 

The following lemma summarizes the main result. 

Lemma 1 (Equivalent Reduced-State Bellman Equation): 
The control policy obtained by solving the original Bellman 
equation in (fTol i is equivalent to the control policy obtained 
by solving the following reduced state Bellman equation. 

e + V{Q') l<i<I (11) 

g(Q\ u(Q')) + J2 m'\Q\ um)ViQ=] 



- mm 

u(Q') 



where V^(Q) = IE[l^(x)|Q] is the conditional poten- 
tial fitnction, u(Q) = ri(Q) is the collection of actions 
under a given QSI Q, 5(Q,u(Q)) = E[g(x, u(x))|Q] 
is the conditional per-stage reward, /(Q-' |Q', u(Q*)) ~ 
E Pr[Q^ Ix*, IQ' is the conditional average transition 

kernel. 

Proof: Please refer to Appendix A for the proof. ■ 
A solution to (fTTI) is still very complex due to the huge dimen- 
sionality of the state space (/ is exponential in A') and brute 
force value iteration or policy iteration [19] has exponential 
memory size requirement. As a result, it is desirable to obtain 
an online and low-complexity solution for the problem. 

IV. General Solution to the Delay Optimal 
Problem 

In this section, we shall derive a low complexity (but 
optimal) solution by proposing an online value iteration to 
solve the reduced state Bellman equation in ( fTTT i. We shall also 
establish technical conditions for almost-sure convergence of 
the online value iteration. 

A. Online value iteration via Stochastic Approximation 

We shall propose an online sample-path-based iterative 
learning algorithm to estimate the performance potential and 
the control policy. Define a vector mapping T : ^ 
with the j-th component mapping (1 < i < /) as 



T,(V)= min 5(Q\u(Q»))+^/(Q^|Q\u(Q'))y(Q- 

(12) 



Since the potential is unique up to a constant, we could 
set Ti(V) — y(Q*) = Jp for some reference state 
(1 < i < /). Let t = {0,1,2,...} be the slot index 
and {Q(0), ■ • ■ , Q{t), ■ ■ ■} be the sample-path, i.e. the cor- 
responding realizations of the system states. To perform the 
online value iteration, we divide the sample-path into regener- 
ative periods. A regenerative period is defined as the minimum 
interval that each Q state is visited at least once. Let lk{i) and 
V/j be the times that Q* is visited and the estimated perfor- 
mance potential in the k-th regenerative period respectively. 
Let no = 0, nk+i = min{t + I : t > n/j, min^ ^^(i) = 1} 
for A: > 0. Then the sample path in the fc-th regenerative 
period is {Q{nk), ■ ■ ■ ,Q{nk+i — 1)}. At the beginning of 
the fc-th regenerative period (n^ < t < Uk+i — 1), initialize 
the following dummy variables as S^^ (i) = 0, Sy^ (i) = 
and lk{i) = 0. Within the fc-th regenerative period, we adopt 
policy rtk- After observing the queue state Q{t + 1) at the 
end of the t-th slot, update the following metric of the visited 
queue state Q{t) according to 

Sy^ii) =Sy^{t) + Vk{Qit + l)) ifQ' = Q(i) (13) 
lk{i) =lk{i) + l 

At the end of the fc-th regenerative period, using stochastic ap- 
proximation algorithm [17], we update the estimated potential 
for the (fc + l)-th regenerative period, which is 



(14) 



where 



lk{i) ' lk{I) W) 



and efe is the step size of the stochastic approximation algo- 
rithm and is the reference state0. Accordingly, we update 
the policy for the (fc + l)-th regenerative period, which is given 
by 

flfc+i =argminT(Vfc+i) (15) 

Therefore, we could construct an online value iteration algo- 
rithm as below. 



Algorithm 1: Online Value Iteration 

• Step 1 (Initialization): Set t = 0, and start the system at an 
initial state Q(0). Set fc = 0, initialize the potential Vq and 
policy ilo = argminT(Vo) in the 0-th regenerative period. 

• Step 2 (Online Potential Estimation): At the beginning of 
the fc-th regenerative period, set Sg^{i) = 0, S^^{i) = and 
^fc(i) — OVi. Run the system with policy ftk to Uk+i — 1 and 
accumulate the information of the visited Q from slot to slot 
according to (fTSl l. At Uk+i — 1, update the estimated potential 
Vk+i for the (fc + l)-th regenerative period according to ( fT4] i. 

• Step 3 (Online Policy Improvement): Update the policy 
ilk+i for the (fc + l)-th regenerative period according to ( fTsl l. 
■ Step 4 (Termination): If 1 1 14+1 ^ ^fc 1 1 < Sy, stop; otherwise, 
set fc := fc + 1 and go to Step 2. 



Hg. 12] illustrates the online value iteration algorithm with 
an example. 

'Without loss of generality, we set the state that all K buffers are empty 
as and initiaUze VoiQ') = 0. 
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Fig. 2. Illustration of online sample path based potential learning algorithm. 
K = 2, Nq = 1, I = [Nq + 1)2 = 4, the four joint queue states are 00, 
01, 10, 11, which are denoted as 0, 1, 2, 3 for simplicity. The sample path 
in the fc-th regenerative period (rife < t < n^+i — 1) is {3, 1, 2, 1, 0, 1}. At 
the beginning of the rife-th slot, set S-g^ (i) = 0, (i) = and Zfc(i) = 0. 
In the fc-th regenerative period, adopt policy $1^ and update S-^^^ (i), (i) 
and lk{i) according to )13t from slot to slot. At the end of the (n^^.! — l)-th 
slot, update the potential Vfe+i(Q') and policy for the (fc + l)-th 

regenerative period according to (14) and 1151 . 



Remark 1 ( Comparison to the deterministic NUM): In 
conventional iterative solutions for deterministic NUM [15], 
the iterative updates (with message exchange^ are performed 
within the CSI coherence time and hence, this limits the 
number of iterations and the performance. However, in the 
proposed online algorithm, the updates in the iteration steps 
evolve in the same time scale as the CSI and QSI. Hence, 
the algorithm could converge to a better solution because the 
number of iterations is no longer limited by the coherence 
time of CSI. 

B. Convergence Analysis 

In this part, we shall establish the technical proof about the 
almost sure convergence of the online value iteration in (fT4l i. 
Assume the sequence of step size {ck} is chosen such that it 
satisfies the following stepsize conditions: 

^Efc^oo, efc>0, '^el < oo (16) 

k k 

Let Efc and Pr^ denote the expectation and probability con- 
ditioned on the tr-algebra IFk, generated by {Vq, Yi,z < k}. 
Define (5Mfc(Q*) ^ YkiQ') - Efc[rfc(QO] and 

Afc_i ^Po,efc-i + (l-efc_i)I 
Bfe_i ^Pj2,_,efe_i + (l-efe_i)I 

where Pn^ and I are / x / transition matrix under policy 
fifc and identity matrix. The convergence of the online value 
iteration algorithm is summarized in the following theorem. 

^ Since the iterations within a CSI coherence time involve explicit message 
passing, there is processing and signaling overhead per iteration and this limits 
the total number of iterations within a CSI coherence time. 



Theorem 1 (Convergence of Online Value Iteration): 
Assume (5Mfc is bounded Vfc and the sequence of step 
size {efc} satisfies the conditions in (fTSI l. Assume that for 
every set of control policies 17o, • ■ • , 51,„, there exist a 
5m ~ 0{em) > and some positive integer m such that 

[A™ • • • Ai]u > (5,„, [B,„ • • • Bi],/ >Sm, 1 < « < / (17) 

where [•],/ denotes the element of the i-th row and I-th column 
of the corresponding I x I matrix and e = [1, • • • , 1]^ is the 
/ X 1 unit vector. For an arbitrary initial potential vector Vq, 
we have liinfe^oo V^. = VooW.p.l, where the steady state 
potential Voo satisfies: 



(T7(Voo) - Ko(Q^))e + Voc = T(V, 



(18) 



Furthermore, the optimal reward of the delay optimal prob- 
lem is Ji3* = ^/(Voo) — V^c5c(Q^) and the optimal stationary 
policy is n* = argminT(Voo)- 

Proof: Please refer to Appendix B. ■ 

Remark 2: Note that and are related to an equiv- 
alent transition matrix of the underlying Markov chain. (1% 
simply means that state is accessible from all the Q states 
after some finite number of transition steps. This is a very 
mild condition and will be satisfied in most of the cases we 
are interested. 

Remark 3: Note that ( fTSl l is equivalent to the reduced state 
Bellman equation in ( fTTI ). As a result, the converged solution 
of ( fTSl l corresponds to the solution of ( fTTT i. 

V. Application to OFDMA Systems with Poisson 
Arrival 

In this section, we shall illustrate the usage of the online 
value iteration to minimize the average weighted delay of 
OFDMA systems under Poisson packet arrival and exponential 
distributed packet size. 

Assumption 3: The arrival process Ak(t) is i.i.d. over 
scheduling slots following Poisson distribution with average 
arrival rate E[Afc] = Afc. The random packet size (t) is i.i.d. 
over scheduling slots following an exponential distribution 
with mean packet size Nk- 

Assumption 4: The slot duration r is sufficiently small 
compared with the average packet interarrival time as well 
as conditional average packet service tim^ i.e. A^r ^ 1 and 

IJ-k{x)T « 1- 

Conditioned on the system state x, the conditional mean de- 
parture rate of user k is given by Pk{x) ~ ^[Rk{x) / ^klx] = 
Rk{x)/ Nk- Thus, the conditional probabiUty (conditioned on 
the current system state realization x(t)) of a packet departure 
event at the t-th slot is given by 



Pr 



Nkit) 



<T\xit),n{x{t))\ =Pr 
Mfe(x(i))T 



Nkit) 



Nk 



< Mx{t))T 



-Rk{t) 

=1 - cxp(-^fe(x(t))r) 

where the is due to Assumption |4] Note that under As- 
sumption ID the probability for simultaneous arrival, departure 

'This is a mild assumption which could be justified in many appUcations. 
For example, in WiMAX, a frame duration is around 2ms while the target 
queueing delay for applications (such as video streaming) is around 200ms 
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of two or more packets from the same queue or different 
queues and simultaneous arrival as well as departure in a 
slot are 0((Afcr)2), 0((^,,(x(t))r)2) and O {Xut iik{x{t))T) 
respectively, which are asymptotically negligible. Therefore, 
the vector queue dynamics are given by 

/[(Ql,-- - ,(01 + i)ajvq,--- ,%)IQ^u(Q*)] « AfcT, vfc 

/[(gi,-- - ,(Q', -!)+,•■ • ,QV)|Q\u(Q')]«7I,(Q'')t, Vfc 
/[Q'|Q^u(Q^)] « l-^(AfeT + 7Z,(Q')r) (19) 

A; 

where x^Nq = min{x, TVg}, Q* = {Q\,- ■ ■ ,Q\i), and 
/^^(Q) IE[/ife(x)|Q]. In what follows, we shall discuss the 
optimal solution and asymptotically optimal solution with only 
linear complexity and memory requirement of the OFDMA 
system with the conditional transition kernel given by (fT9] l. 



subcarrier allocation solution in Lemma|2]are both functions of 
CSI and QSI where they depend on the QSI indirectly via the 
potential function {y(Q)}. For the power control solution, 
it has the form of multi-level water-filling where the power 
is allocated according to the CSI across subcarriers but the 
water-level is adaptive to the QSI. Similarly, the subcarrier is 
selected according to the metric Xk,n which depends on both 
the CSI and the QSI. 

While the online value iteration is much simplified com- 
pared with the brute-force MDP solutions, it still suffers from 
exponentially large memory requirement and computational 
complexity for storing and computing the potential vector 
In the next subsection, we shall exploit the birth-death dy- 
namics and derive an asymptotically optimal solution with 
linear 0{KNf) computational complexity and 0{K) memory 
requirement. 



A. Optimal Solution 

Given ( fT9] l, the optimization objective in the R.H.S. of 
Bellman equation in (fTTT i becomes 

K 

^[iY.puAxi-Y.^kvm^ (20) 



k=l 



■ ( Sk,n l0g(l + PkAHkA^)) \Ct 

Where ^kV{Ce) ^ V {Q\, ■ ■ ■ , Q'j,) - V{Q\,--- ,[Ql - 
7*3 A')- Using standard optimization techniques, the 
closed-form solution during the policy improvement step ( fTSl l 
in the online-value iteration algorithm is summarized below. 

Lemma 2: (Closed-Form Power Control and Subcarrier Al- 
location of Online Policy Improvement) Under the above setup, 
the optimizing power control and subcarrier allocation actions 
in the policy improvement step ( fTSI l for given CSI H and QSI 
Q are given by: 



Pfe,„(H,Q) =Sfe.„(H,Q) 



\Hk,n\ 



/XT / 1, if Xk,n = maxj {Xj,„} > 

Sfe,„l±i,i<4j -| 0^ otherwise 



(21) 
(22) 



where 7 satisfies Q and Xk 



Afey(Q)log 1 



\Hk 



N 



Proof: Due to the combinatorial nature of the subcarrier 
allocation in the OFDMA system, i.e. Sfc „ G {0, 1} and 
(|4]i, finding the optimal solution by brute force exhaustive 
search requires exponential complexity [20]. By applying 
continuous relaxation technique [20], [21], the original integer 
programming problem can be relaxed to a convex optimization 
problem and then the standard convex optimization techniques 
can be applied. It can be shown that the original problem and 
the relaxed problem share the same optimal solution in general. 
We omit the detailed proof here due to page limit. ■ 
Remark 4: {Structure of the Optimal Power Control and 
Subcarrier Allocation) The optimal power control and the 



B. Asymptotically Optimal Solution 

We first define a simplified subcarrier allocation policy 
below and summarize an important structural result of the 
Bellman equation in (fTTT i under the simplified policy. 

Definition 3: [CSI-Only Subcarrier Allocation Policy] A 
CSI-only subcarrier allocation policy is defined as r2s(H) ~ 
{sfc,„(H) e {0,l}|Ef=i5fc,n = IVn}. 

Theorem 2 (Additive Property of the Potential Function): 
Under the average power constraint in (|3]l and the CSI- 
only subcarrier allocation policy, the solution of the 
Bellman equation in (fTTT i can be expressed into the form 
^(Q) = Efc y^iQk) and e = Ok, where {Vk{Qk). Ok} is 
the solution of the fc-th user's reduced Bellman equation: 



k = min {gk{Qk, Ufc(Qfc)) + \kT^Vk{Qk + 1) 

Ufc(Qfc) 1^ 



Tik{Qk)T/:^Vk{Qk)} 



(23) 



where Uk{Qk) = {(i5(H, Q,.), S(H)) : (H, Ofc)VH} is Ae 
set of the conditional control actions, JlkiQk) = ^[J2n=i 

^k,n 

(H) log(l + {li,Qk)\Hk,n\^)\Qk]/Nk is the con- 
ditional average service rate and AVfe(Qfe) = Vk{Qk) — 
Vk{[Qk — 1]^) is the potential increment of the fc-th queue. 
Proof: Please refer to Appendix C for the proof. ■ 
As a result of Theorem |2] if we restrict the subcarrier 
allocation policy to the CSI-only subcarrier allocation policy, 
then the potential function {V^(Q)} of the joint QSI for the 
original MDP problem can be decomposed into K individual 
potential functions {Vk{Qk)} of the individual QSI for the 
K individual MDP's. This could substantially simplify the 
potential estimation, the convergence speed as well as the 
memory requirement. In particular, to satisfy the condition 
of Theorem 2, we modify the optimal subcarrier allocation 
solution in Lemma |2] to a CSI-only policy which is given by 



Sfc.„(H) = 



1, if \Hk,„\ 
0, otherwise 



maxj {|i7j,„|2} 



(24) 



While using ( |24] | will result in some loss of optimality in 
strict sense, we shall show in the following Corollary that the 
solution is indeed asymptotically optimal. 
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Corollary 1 (Reduced Complexity Online Value Iteration): 
The following reduced complexity online value iteration 
algorithm has 0{KNp) complexity and 0{K) memory 
requirement. It converges almost surely to a solution which 
is asymptotically optimal for sufficiently large /if*"! 



VI. Simulation Results and Discussions 

In this section, we shall compare our proposed optimal and 
reduced complexity solutions by online value iteration via 
stochastic approximation to the delay optimal problem for the 
system with Poisson arrival and exponential packet size with 
three reference baselines. Baseline 1 refers to a throughput 
optimal polic}{3, namely the Modified Largest Weighted Delay 
First (M-LWDF) [22]. BaseUne 2 refers to the Real Time 
Stochastic Primal Dual (RT-SPD) algorithm [12]. Baseline 3 
refers to the Round Robin Scheduling, in which different users 

'"As we scale up K, we assume the transmit total power Pq is sufficiently 
lai'ge so that ( Ai , ■ ■ ■ , Xk) still remains in the stability region of the system 

"Throughput optimal policy means that it shall stabilize the queue when- 
ever the aiTival rate vector falls within the stability region. 




Proposed (Centralized) 



5 10 15 

SNR{dB) 

Fig. 3. Average delay per queue versus SNR. The number of users K = 2, 
the buffer size Nq = 10, the mean packet size = 305.2 Kbyte/pck, the 
average arrival rate = 20 pck/s, the queue weight /3i = /32 = 1- The 
packet drop rate of the proposed schemes are 5%, while the packet drop rate 
of the Baseline 1 (M-LWDF), Baseline 2 (RT-SPD) and Baseline 3 (Round 
Robin) are 5%, 5%, 6% respectively. 



10 




5 10 15 

SNR(dB) 



Fig. 4. Average weighted delay versus SNR. The number of users K = 2, 
the buffer size Nq = 10, the mean packet size A'^^ = 305.2 Kbyte/pck, the 
average arrival rate A^ = 20 pck/s, the queue weight /3i = 1, /32 = 4. The 
packet drop rate of the proposed schemes are 4%, while the packet drop rate 
of the Baseline 1 (M-LWDF), Baseline 2 (RT-SPD) and Baseline 3 (Round 
Robin) are 4%, 6%, 6% respectively. 



are served in TDMA fashion with equally allocated time slots 
and water-filling power allocation across the subcarriers. In 
the simulation, we assume there are 64 subcarriers with total 
bandwidth lOMHz, and the number of independent subbands 
Np is 4. The scheduling slot duration r is 5ms. The buffer size 
Nq is 10. The average packet arrival rate of Poisson process 
Xk is 20 packet/s. 

Fig. [3] illustrates the average delay per queue versus SNR of 
2 users with equal queue weight. It can be observed that both 
the optimal solution and reduced complexity solution have 
significant gain compared with three baselines (e.g. more than 
5 dB gain when average delay per queue is less than 9 packets). 
In addition, the delay performance of the reduced complexity 
solution, which is asymptotically optimal in large number of 
users, is very close to the performance of the optimal solution 



Algorithm 2:Reduced Complexity Online value iteration 
• Step 1 (Initialization): For each user k ~ 1 : K, initialize 
the potential Vjl and policy = argminT(V^) in the 0-th 
regenerative period. Set t = 0, and start the OFDMA at an 
initial QSI state Qk{0) for each user 

Step 2 (Online Potential Estimation): For each user k = 
1 : K in its i^-th regenerative period, set S^i^{Qk) = 0, 

Syik (Qk) = and j^iQk) — OVQk- Perform power control 

and subcarrier allocation according to the policy fi^*, and 
accumulate the information of the visited Qk for each user 
k if Qk ~ Qk{t) according to 

S^^kiQk) = S^^^kiQk) + gk{xk{t),n[^{xk{t))) 
S^ik (Qk) - S^.k (Qk) + VtiQkit + 1)) 

If the current slot corresponds to the end of any of the 
K regenerative period, update the corresponding estimated 
potential V^'' for its next regenerative period according to 

{0<Qk< Nq) 

>ik+i 



Vt^\Qk) = V^''iQk) + ei, 



S~ikiQk) S^i.k{Qk) 



s.kiQi) s^.kiQl) ^, 

Step 3 (Online Policy Improvement): For each user k = 
1 : K, if the current slot corresponds to the end of any of 
the K regenerative periods, then update the policy for its next 
regenerative period at the BS according to ( |24] | and 

PkA'il.Q>^) = h.niVL)(^ YT^) (25) 

■ Step 4 (Termination): If ||V;''=+^ - Vl"]] < (S^Vfc, stop; 
otherwise, set Ik := Ik + 1 and go to Step 2. 



Proof: Please refer to Appendix D for the proof. 
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Number of Users 



Fig. 5. Average delay per queue versus the number of users. The buffer size 
Nq = 10, the mean packet size Nf; = 39 Kbyte/pck, the average anival rate 
Xk = 20 pck/s, the queue weight = 1 at a transmit SNR= lOdB. The 
packet drop rate of the proposed scheme is 4%, while the packet drop rate 
of the Baseline 1 (M-LWDF), Baseline 2 (RT-SPD) and Baseline 3 (Round 
Robin) are 4%, 4%, 6% respectively. 



1200 -II 



AverageDelay j p'snm 
Proposed: 3.836 i i 



'v;'*(o) 



X 



'91' (3) 



Average Delay 
Pi"oposed: 3.716 



Average Delay (packets) 
Proposed (Reduced Complexity): 3.716 
Baseliite 1 (M-LWDF): 7,832 
Baseline 2 (RT-SPD): 5.875 
Baseline 3 (Round Robtn): 8.634 



50 100 150 200 250 300 350 

Number of Iterations (Time Duration in a Communication Session) 



Fig. 6. Illustration of convergence property. Potential function versus the 



number of iterations. The number of users K = 8, the buffer size Nq = 10, 
the mean packet size Ni^ = 39 Kbyte/pck, the average arrival rate = 20 
pck/s, the queue weight = 1 at a transmit SNR= lOdB. The packet drop 
rate of the proposed scheme is 1%, while the packet drop rate of the Baseline 
1 (M-LWDF), Baseline 2 (RT-SPD) and Baseline 3 (Round Robin) are 1%, 
1%, 4% respectively. 



even in 2 user-case. Fig. |4] depicts the average weighted 
delay versus SNR of 2 heterogeneous users with different 
queue weight. The average weighted delay of the reduced 
complexity solution is close to that of the optimal solution 
as well. Therefore, the proposed reduced complexity solution 
with linear 0{KNp) memory requirement and computational 
complexity as well as near optimal performance is of great 
practical significance. 

Fig-HJillustrates the average delay per queue of the reduced 
complexity solution versus the number of users with equal 
queue weight at a transmit SNR= lOdB. It is obvious that the 
reduced complexity solution has great gain in delay over the 
three baselines in the whole user region. 

Fig. |6] illustrates the convergence property of the proposed 



reduced complexity algorithm. We plot the average potential 
function of 8 users versus the number of iterations at a transmit 
SNR= lOdB. It can be seen that the reduced complexity algo- 
rithm converges quite fast. The average delay corresponding 
to the potential function at the 50-th iteration is 3.8 packets, 
which is much smaller than the other baselines. 

VII. Summary 

In this paper, we propose a low-complexity solution to 
the delay-optimal power and subcarrier allocation design for 
OFDMA systems. We model the problem as a A'-dimensional 
infinite horizon average reward MDP with the control actions 
based on CSI and joint QSI. We derive the equivalent reduced 
state Bellman equation and propose an online stochastic value 
iteration solution using stochastic approximation. We prove 
that under some mild condition^ the proposed solution con- 
verges to the optimal solution almost surely (with probability 
1). By exploiting the birth-death structure of the queue dynam- 
ics in the Poisson arrivals, we obtain a reduced complexity 
decomposed solution with linear 0{KNp) complexity and 
0{K) memory requirement. 

Appendix 
Appendix A: Proof of Lemma[T] 



+ V{ll,Q') VH,l<i</ 
= min [5((H,Q'),4H,Q'0) 

ii(H,Q') L 

+ J2 Pr[(H',Q^)|(H,Q'),u(H,Q*)]l/(H',Q^) 



(H',QJ) 



(a) 



(6) 



min [5((H,Q''),u(H,Q'')) 
+ J2 Pr[Q^|(H, Q*), ^i(H, Q')] ( Pr[H']l/(H', Q^")) 



H' 



min \g{{U,Q'),u{U,Q'')) 



^4e + v{Q') 



=E I min 

^(H,Q') 



^Pr[Q^|(H,Q^),^^(H,Q')]y(Q^■) 

Qi 

VI < i < / 
g((H,QO,«(H,QO) 



1 mi 

Lu(H,' 

. ^ Pr[QJ"|(H, Q^), ^i(H, Q')] ( ^ Pr[H']y(H', Q^")) |Q^ 



Qi 



H' 



(d) 



min g(Q^u(Q')) + ^/(Q^|Q^u(Q^))y(Q^■) (26) 

where (a) is due to (b) is due to the definition 

V{Q) = E[y(x)|Q], (c) is obtained by taking the con- 
ditional expectation (conditioned on Q*) on both sides of 
dTol ) and (d) is due to the definition of "conditional actions" 

'^The mild conditions refer to the two conditions in Theorem[T] One is the 
stepsize condition. The other is the condition on accessibihty of the Markov 
Chain, which can be easily satisfied in most of the cases we are interested. 
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and 5(Q,u(Q)) = n9{x,u{x)m], /(Q^' |Q\ u(Q'O) 



E 



Appendix B: Proof of Theorem[T] 

We shall first show the convergence of the martingale noise 
in dull. Let 

gfc(Q'') =Efe[n(Q')] 
=T,(Vfc(QO) - Vfc(Q') - (rKVfc(QO) - Vfe(Q^)) (27) 

^fe(Q') is the noise-corrupt observation of qfc(Q*). The mar- 
tingale difference noise is 



^9ki^) ~,r^i^\ 



+ ( 



lk{i) 



mi) - (■ 



ik{i) 



h+iii) 
lk+i{I) 



^/(Q^|Q^u(Q^))F,(Q^")) 



with property that Efc[(5Affc(Q*)] = and 
E[SMk{Q')SMk>{Q')] = 0, Vfc ^ k'. For some j, define 
MkiQ') ^ Ef=j eiSMiiQ'). Thus, from we have 

Vfc+i(Q') = Vfc(Q')+efc[gfc(Q') +<5A/fc(Vfc(Q'))] 



=V,(Q')+^Qg,(Q^) + Mfc(Q') 

1=3 



(28) 



Since Efe[Affc(Q*)] = 
martingale sequence. By 

have Prj {supj<,<fc ^ .vj ^ 3^ 

the boundness assumption and the property of the 

martingale difference noise as well as the condition 



Mfe_i(Q'), MkiQi is a 
martingale inequality, we 

> A} < ^-'I^^^^Q'^I''. By 



on the stepsize sequence, we have Ej[|Affc(Q* 



EU^3[^fSMl{Q^)] 



< 



MEtj^f => lim.^oo Pr, { sup^.<,<fc |A/,(Q;)I > A} = 0. 
Thus, as j 00, ( |28] ) goes to Vfc+i(Q') = 

Vi(Q*)+E/Lj ei9i(Q') with probability 1, the vector 
form of which is given by 



1=3 



k r 



1=3 



(29) 



Next, we shall show the convergence of (1291 ) after the 
martingale noise are averaged out. In the following proof, 
we use i instead of Q' for simplicity. Let ^lk{i) denote the 
optimal control action attaining the minimum in T!i(Vi.). Let 
gOfc and Pofc denote the conditional per-stage reward vector 
and conditional average transition probabiUty matrix under the 



optimal control policy ilk- Denote Wk 
have 



Ti{Vk)~Vk{I).We 



Qfe-l = gOfc_i + PUk-i'^k-l — 

=>Afe_iqfe_i - {wk - Wk-i)e < 
<Bfe_iqfe_i - {wk - Wfc-i)e, Vfc > 1 



Wk-ie 



by iterating 



'Afc_i • • • Ak-mUk-m - {wk 
<Bfc_i • • • 'Bk-mq_k-m — {wk 



- Wk-m)e < qfc 

■ Wk-Tn)e 



From m, we have qfe(/) = r/(Vfe(/)) - Vfc(I) - 
(r/(Vfe(/)) - Vfe(/)) = for all k. By the assumption O, 
we have 

(1 - (5)minqfe_,„(i') - {wk ~ Wk-m) < qfc(«) 

r 

< (1 - ^) maxqfc-„i(i') - {wk - Wk~m)y i 

i' 

^ I minj/ qfc(i') > (1 - (5) miiii/ qfc-r„(i') - {wk - Wk-m) 
|maxj/ qfc(i') < (1 - 5) max^/ qfc-r„(i') - (wfc - Wk-m) 
=»maxqfc(i') - minqfe(i') 

i' i' 

< (1 - ^)(maxqA;-rn(^0 - Tninqk-mi^)) 

r 

^maxqfc(i') - minqfc(i') < 0^ TT (1 ~ ^3+irn) 



1=1 



where 



<j > 0. Since qt.. (/) = 0, we have maxi/qfe(i') > 
and mini'q;;(i') < 0. Thus, Vi, we have |qA:(j)| < 

maxi/ qfc(i') - min^/ qfe(i') < (t)j YliJi (1 ~ Sj+im). 

Therefore, as fc — > 00, qfe ^ 0, i.e. Vk satisfies Bellman 
equation ( fTsT l. By the Proposition 1 in Chapter 7 of [13], 
J[) = Ti{Yk{I)) - Vfe(/) is the optimal value and is the 
potential vector, which is up to an constant vector. However, 
due to the property that qfc(/) = OVfc ^ Vfc(/) = Vo(/)Vfc, 
we have the convergence of the potential vector Voo ~ 
limfc_,oo Vfc and the optimal value Jp* = Ti(Voo) -Voo{Q') 
by the online value iteration via stochastic approximation 
algorithm. Accordingly, the optimal stationary policy is SI* = 
argminT(Voo). 



Appendix C: Proof of Theorem|2] 

Solution of Bellman equation in (fTTT i can be obtained by 
offline relative value iteration [19]. Without loss of generality, 
we set Q^as the reference state. Hence, we have normalizing 
equation V^{Q') = OV/. Assume V'{Q) = Y.k=i ^fc(<9fc) V/. 

At the (/ — l)-th iteration, updating policy according to (fTST i 
is equal to finding policy Sl^ which minimize the objective 
function (|20] i. Under any given CSI-only subcarrier allocation 
policy, the optimal power actions for the l-th iteration which 
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minimize ( |20] i is given by 



pi_„(H,Qfe) = 4„(H)( 



Nf 



(Qk) 



k.n I 



^JiiiQk} = 4,„(H) log(l +pi,„(H, Q^m 



n=l 



rate under 

i(H,Qfc) = 



where liiQk) is the mean departure 
= {4(H,0fe) : jH,Qfe)VH} with 
(p'(H,Q,),S'(H)), and AVl{Qk) = VliQk)-Vl{[Qk-l] + ) 
is the potential increment for the k-th queue. 

At the ^-th iteration, we determine the potential y'(Q) and 
6*' by solving the normalizing equation V^'(Q^) — together 



with / 



1) Poisson equations 



J2UQl,^i{Ql)) + v\Q') 



Theorem 10, almost sure convergence of the online value 
iteration algorithm in Corollary 1 is guaranteed. 
^ + Next, we have to show that the converged solution is indeed 

asymptotically optimal. Denote fc* = argmax^ |i?fe.„|^. For 
large K, |-fffc',nP grows with log(A') by extreme value 
^)|Q]/iVi;heory. Because the traffic loading remains unchanged as we 
scale up K, max^j |Afey(Q) - AjV{Q)\ = 0(1). Hence, 
Xfc.,„ grows like log(log(/-ir)). As K oo, Pr[fc* = 
argmaxfc .Tfc_„] = 1. Thus the subband allocation result of 
optimal subband allocation in (l22l l and the best CSI subband 
allocation in (|24] | will be the same for large K. Thus, (|24] | 
and (IZST i are asymptotically optimal. Therefore, with ( |24] |. fol- 
lowing the proof of Theorem|2] we can prove J^k ^fe(Qfc) ^ 
V*{Cl), 'Ek^k 0* as K oo, where V*{q,) and 0* 
are the potential and optimal average reward under the global 
optimal power and subcarrier allocation given by Lemma |2] 



fe 



\kTAV\Q^^ + 1) -Y,Tl\{Ql)rAV\Ql) 



References 



k 



\fl<i< I 



where 



vo < 



=gkiQKiQl)) + ^krAVi{Ql + l) 



(30) 



?fe < Nq (31) 
--piiQDrAVliQl) 



QkiQlAiQl)) ^ +inY.PkA^n.Ql)\Q 



There are / joint Q — (Qi, ■ • • , Qk) states, but there are only 
Nq + 1 States for Qk'^k. Hence, in the original {Nq + 1)^^ 
Poisson equations (l30l l. there are only Nq + 1 independent 
Poisson equations in dSTl i for the fc-th (1 < fc < K) queue. 
In addition, set Vl{0) = OVfc as the individual normalizing 
equation, which also satisfies V\Q^) = J2k^ki^) = ^■ 
Hence, in the l-th iteration, we can obtain {Vl{Qk),0'^} by 
solving the A:-th user's reduced state Poisson equation in dSTl i 
together with its the individual normalizing equation. Accord- 
ingly, {y'(Q), 6*'} is the solution Jor the originaljA^g + l)^ 
Poisson equations (l30l l. where V^'(Q) = J^k^kiQk) and 

o' = Ek(^l 



Continue the iterations until O 



nlP. We obtain 



{Vk{Qk), ()k}, which is the solution of the fc-th user's reduced 
Bellman equation in (|23] l. Accordingly, {V{Q),6} is the 
solution of the original {Nq + 1)^ Bellman equations in 
O, where V{Q) ^ J2k ^k{Qk) and = Y^k^k, which are 
the potential for the joint Q and the optimal average reward 
respectively. 

Appendix D: Proof of Corollary 1 

Since {Vk{Qk), Sk} is the solution of the fc-th user's reduced 
Bellman equation ( |23l l, the original MDP can be decoupled 
into K individual MDP's with transition kernel is given by 
f{Qk + l|(3fc,Ufc(Qfe)) = AfcT, f{Qk - l\Qk,vik{Qk)) = 
'Pk{Qk)T and /((5fe|Qfe,Ufe(gfc)) = 1 - A^r - JIk{Qk)r- 
Therefore, the online value iteration can be applied to K 
individual MDP's respectively. Under the same condition of 



[1] K. Seong. M. Mohseni, and J. Cioffi, "Optimal resource allocation for 

OFDMA downlink systems," in IEEE Int. Symp. Inform. Theoiy (ISIT), 

July 2006. pp. 1394 - 1398. 
[2] C. Y. Wong, R.S. Cheng, K.B. Lataief. and R.D. Murch, "Multiuser 

OFDM with adaptive subcarrier, bit, and power allocation," IEEE J. 

Select. Areas Commun., vol. 17, no. 10, pp. 1747 - 1758, Oct. 1999. 
[3] D. Wu and R. Negi, "Effective Capacity: A Wireless Link Model for 

Support of Quality of Service." IEEE Trans. Wireless Commun., vol. 2, 

pp. 630-643, July 2003. 
[4] D. Hui and V. Lau, "Cross-Layer Design for OFDMA Wireless Sys- 
tems With Heterogeneous Delay Requirements," IEEE Trans. Wireless 

Commun., vol. 6, pp. 2872-2880, Aug. 2007. 
[5] J. Tang and X. Zhang, "Quality-of-Service Driven Power and Rate 

Adaptation over Wireless Links," IEEE Trans. Wireless Commun., vol. 6, 

pp. 3059-3068, Aug. 2007. 
[6] E. M. Yeh, "Multiaccess and Fading in Communication Networks," 

Ph.D. dissertation, MIT, September 2001. 
[7] E. M. Yeh and A. Cohen, "Throughput and delay optimal resource 

allocation in multiaccess fading channels," in Proc. ISIT, June-July 2003, 

p. 245. 

[8] L. Georgiadis, M. J. Neely, and L. Tassiulas, "Resource allocation and 
cross-layer control in wireless networks," Foundations and Trends in 
Networking, vol. 1, no. 1, pp. 1-144, 2006. 

[9] M. J. Neely, "Order optimal delay for opportunistic scheduling in multi- 
user wireless uplinks and downlinks," IEEE/ACM Trans. Networking, 
vol. 16, no. 5, pp. 1188-1199, 2008. 
[10] W. Luo and A. Ephremides, "Stability of n interacting queues in random- 
access systems," IEEE Trans. Inform. Theory, vol. 45, no. 5, pp. 1579 
- 1587, July 1999. 

[11] A. Stolyar, "Maximizing queueing network utility subject to stability: 
Greedy primal-dual algorithm," Queueing Systems, vol. 50, no. 4, pp. 
401^57, 2005. 

[12] X. Wang, G. B. Giannakis, and A. G. Marques, "A unified approach 
to qos-uaranteed scheduling for channel-adaptive wireless networks," 
Proceedings of the IEEE, vol. 95, no. 12, pp. 2410-2431, 2007. 

[13] D. P. Bertsekas, "Dynamic programming - deterministic and stochastic 
models," Prentice Hall, New Jersey, NJ, USA, 1987. 

[14] X. R. Cao, Stochastic Learning and Optimization: A Sensitivity-Based 
Approach, 1st ed. New York: Springer, 2007. 

[15] D. P. Palomar and M. Chiang, "A tutorial on decomposition methods for 
network utillity maximization," IEEE J. Select. Areas Commun., vol. 24, 
no. 8, pp. 1439 - 1451, Aug. 2006. 

[16] L. Ljung, G. Pflug, and H. Walk, Stochastic approxiamtion and opti- 
mization of random systems, 1 st ed. Berlin: Birkhasuser Verlag Basel, 
1992. 

[17] H. J. Kushner and G. G. Yin, Stochastic approxiamtion and optimization 
of random sy.stems, 2nd ed. New York: Springer, 2003. 

[18] L. Kleinrock, Queueing Systems. Volume I: Theory, 1st ed. New York: 
Wiley Interscience, 1975, ch. 2. 

'^It can be easily verify that all states in the birth-death chain are accessible. 



IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 9, NO. 1, JANUARY 2010 



11 



[19] D. P. Bertsekas, Dynamic Programming and Optimal Control, 3rd ed. 

Massachusetts: Athena Scientific, 2007. 
[20] L. M. C. Hoo, B. Haider, J. Tellado, and J. M. Cioffi, "Multiuser 

transmit optimization for multicarrier broadcast channels," IEEE Trans. 

Commun., vol. 52, no. 6, June 2004. 
[21] W. Yu and J. M. Cioffi, "FDMA capacity of gaussian multiple-access 

channels with ISI," IEEE Trans. Commun., vol. 50, no. 1, Jan. 2002. 
[22] M. Andrews, K. Kumaran, K. Ramanan, A. Stolyar, P. Whiting, and 

R. Vijayakumar, "Providing quality of service over a shared wireless 

link," in Communications Magazine, IEEE, vol. 39, no. 2, Feb. 2001, 

pp. 150-154. 




Vincent K. N. Lau obtained B.Eng (Distinction 1st 
Hons) from the University of Hong Kong in 1992 
and Ph.D. from Cambridge University in 1997. He 
was with PCCW as system engineer from 1992-1995 
and Bell Labs - Lucent Technologies as member of 
technical staff from 1997-2003. He then joined the 
Department of ECE, HKUST as Associate Professor. 
His current research interests include the robust and 
delay-sensitive cross-layer scheduling, cooperative 
and cognitive communications as well as stochastic 
approximation and Markov Decision Process. 




Ying Cui received B.Eng degree (first class honor) 
in Electronic and Information Engineering, Xian 
Jiaotong University, Xi'an, China in 2007. She 
is currently a Ph.D candidate in the Depart- 
ment of Electronic and Computer Engineering, the 
Hong Kong University of Science and Technology 
(HKUST). Her current research interests include 
cooperative and cognitive communications, delay- 
sensitive cross-layer scheduling as well as stochastic 
approximation and Markov Decision Process. 



